Price purpase on 4, 6, 15, 21 and 30 days prior de departure have the best correlation with number of passengers.
Next, checking those variables for wide enough spread to be significant and avoid overfitting.
Keep Day 4, 15, 21 and 30 but prefer Day 6 over Day 7 for its variablility
##
## Call:
## lm(formula = Revenue.passengers ~ Day_4 + Day_6 + Day_15 + Day_21 +
## Day_30, data = dfa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -334075 -82093 -19457 122307 345473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1893414 530668 3.568 0.00257 **
## Day_4 -182728 1389904 -0.131 0.89704
## Day_6 -2893690 1974663 -1.465 0.16218
## Day_15 7766905 5370514 1.446 0.16742
## Day_21 -2909230 5940073 -0.490 0.63095
## Day_30 4281765 2816676 1.520 0.14799
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 193600 on 16 degrees of freedom
## Multiple R-squared: 0.6089, Adjusted R-squared: 0.4867
## F-statistic: 4.983 on 5 and 16 DF, p-value: 0.006102
## (Intercept) Day_4 Day_6 Day_15 Day_21 Day_30
## 1893413.5 -182727.6 -2893690.0 7766904.6 -2909229.5 4281764.6
Model plotted above has a very significant predictive model. Next step predicting Quarterly Revenue. Note: It is possible that this airline has a very predictive seasonal traffic that would explain such a blattant correlation.
dfa <- try(read.csv( paste0(basepath,"/data/B6_quarterly_finance_ppm.csv") , header=TRUE, sep = "," ,stringsAsFactors=FALSE) )
dfa$Revenue <- as.numeric(as.character(dfa$Revenue))
dfa$Quarter_ending <- as.Date(dfa$Quarter_ending, "%m/%d/%y")
#pairs(dfa[,c(2,4,5,8,9,10,11,13,15,16,17)], pch = 19)
fit <- lm(Revenue ~ Day_4+Day_7+Day_15+Day_30+Day_21, data = dfa )
summary(fit)
##
## Call:
## lm(formula = Revenue ~ Day_4 + Day_7 + Day_15 + Day_30 + Day_21,
## data = dfa)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -7.6096 0.6027 1.2713 -38.6544 25.8094 32.4321 -8.1616 -5.6899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 802.1 409.0 1.961 0.189
## Day_4 1887.4 850.0 2.221 0.157
## Day_7 -2303.9 3878.9 -0.594 0.613
## Day_15 3931.8 4169.5 0.943 0.445
## Day_30 1028.4 1222.0 0.842 0.489
## Day_21 -2710.1 3311.9 -0.818 0.499
##
## Residual standard error: 41.05 on 2 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.6977
## F-statistic: 4.231 on 5 and 2 DF, p-value: 0.2022
pre <- predict( fit, dfa)
dfa$pre <- pre
acu <- data.frame(cbind(actuals=dfa$Revenue, predicteds=pre))
acu
## actuals predicteds
## 1 1478 1485.610
## 2 1487 1486.397
## 3 1571 1569.729
## 4 1477 1515.654
## 5 1451 1425.191
## 6 1650 1617.568
## 7 1623 1631.162
## 8 1564 1569.690
plot(acu)
coef(fit)
## (Intercept) Day_4 Day_7 Day_15 Day_30 Day_21
## 802.076 1887.418 -2303.878 3931.789 1028.374 -2710.069
bin<-hexbin(dfa$Revenue, pre, xbins=3)
plot(bin, main="Hexagonal Binning")
PRETTY GOOD, NO?
Chart below is representing normalized (around their means) the fields: revenue, prediction, day 0 to 180
Note: Time serie is per quarter… January 2018 is in fact 2017 Q4 …